Chapter 4 Extra skills

In this section of my portfolio i will show you some of the extra skills I have developed during my data science minor.

4.1 Writing an introduction using Zotero for references

Writing research papers is a pretty important for the research field, so using references to other papers is crucial for writing a good introduction. below here you can see an introduction to my project about liquid biopsies, it makes use of multiple references.

Neuroblastoma is the fourth most common tumor in children and presents itself as either a low-risk neuroblastoma or a high-risk neuroblastoma. determining which risk neuroblastoma is present a tumor biopsy is necessary. (Weiser et al. 2019) however these biopsies are very invasive and can only be done a few times even tho the tumor keeps mutating. to counter this problem researchers have become more and more interested in liquid biopsies to be able to track the mutations in the tumor. this is possible because the tumor often excretes DNA into the blood stream which is then used for whole-exome sequencing (WES). these results can be compared with the DNA of the tumor to show how the tumor evolves over time (Chicard et al. 2018). because the tumor cells have different properties depending on what part of the tumor they are on it is difficult to show all the mutations based on 1 tumor biopsy as that can have a bias for only that specific part of the tumor where the biopsy was taken from. liquid biopsies also help with this problem as they sequence all the DNA that has been excreted by the tumor, this makes it possible to spot different mutations in both tumor DNA and cell free DNA (cf-DNA). (Van Paemel et al. 2022)
The type of mutations that get the most attention are the copy number variations/aberrations (CNV/CNA) these CNV’s can either be very small with just a few kilobases or very big where they cover the whole chromosome. the CNV’s are very importants as they can give an identification on how pathogenic the tumor is based on the genes it contains, the position of the CNV and the size of the CNV (Riggs et al. 2020).
Some hospitals sadly don’t have easy ways to analyse the data from all these patients because they do not have the experience with data analysis. this makes the process of analyzing the data a slow and tedious task. even tho it would be extremely beneficial for the hospitals without data scientists to have the programs available to analyse these results quickly (Valsesia et al. 2013), this takes away the time consuming task of having to analyze every file manually which gives them time to focus on more important things like treating the patient. sharing these results to other hospitals is equally as important because there is not nearly enough data available to say with certainty how dangerous certain tumors are and if the tumors have been fully removed. With all the data combined the research towards liquid biopsies can evolve quickly making diagnoses easier and more reliable
In this project we will analyse the tumor DNA and the cfDNA created by WES and will make it reproducible so that Princess maxima centre can easily analyse the data for all their patients.
to accomplish this we will mostly focus on:
– giving all the CNV’s different ID’s so they can be easily distinguished from each other.
– showing which cytogenetic band the CNV falls in to.
– making an interactive plot to look up genes easily.
– automatically filtering genes that indicate high risk neuroblastoma’s.
– making a high throughput version so that it will analyse multiple datasets at the same time without having to manually insert all the data sets – getting these results quickly and easily so the researcher does not have to focus on how to analyze the data.

4.2 creating paramaters for different data inputs

to show my ability to use paramaters I will be using data from the ECDC. the data is available in this repository under “data/COVID_cases_31_05_2022”

# loading in data
cases <- read.csv("data/COVID_cases_31_05_2022.csv")

# filtering the params used
cases_filtered <- cases %>% dplyr::filter(countriesAndTerritories == params$country, year == params$year, month >= params$period_start, month <= params$period_end) 

# telling R the dateRep column is a date
cases_filtered$dateRep <- as.Date(cases_filtered$dateRep, format = "%d/%m/%Y")

# making a graph for cases
cases_graph <- cases_filtered %>%
  ggplot(aes(x = dateRep, y = cases)) +
  geom_point(size = .5) +
  geom_line() +
  labs(title = paste("Covid related cases from month", params$period_start, "to", params$period_end, "in", params$year, "for", params$country),
       x = "Month",
       y = "Covid related cases") +
  theme_classic()

ggplotly(cases_graph)
# making a graph for deaths
deaths_graph <- cases_filtered %>%
  ggplot(aes(x = dateRep, y = deaths)) +
  geom_point(size = .5) +
  geom_line() +
  labs(title = paste("Covid related deaths from month", params$period_start, "to", params$period_end, "in", params$year, "for", params$country),
       x = "Month",
       y = "Covid related deaths") +
  theme_classic()

ggplotly(deaths_graph)

If you want to recreate these graphs with different params clone this repository and use put this command in the console (with your own params ofcourse): bookdown::render_book(params = list(country = "Netherlands", year = 2021, period_start = 5, period_end = 10))

4.3 Using SQL to store data

Databases are an important feature of data science (it is even in the name), thats why being able to request data from a database is equally as important as actually analyzing the data.

First we need to make a new database, this has been done with the use of DBeaver and the code for creating the database can be found below.

SQL script of database creation

after creating the database we need to make a connection with DBeaver.
to reproduce this:
* make a new database in dbeaver called “dengue_flu_data”.
* insert your own password in the yaml header of this file in my repository.

library(DBI)

con <- dbConnect(RPostgres::Postgres(),
                 dbname = "dengue_flu_data",
                 host = "localhost",
                 port = "5432",
                 user = "postgres",
                 password = params$password)
library(tidyverse)
library(dslabs)
library(car)

gapminder <- gapminder
dengue_data <- read_csv("data/dengue_data.csv", skip = 11)
flu_data <- read_csv("data/flu_data.csv", skip = 11)

dengue_data %>% head(5)
## # A tibble: 5 x 11
##   Date       Argentina Bolivia Brazil India Indonesia Mexico Philippines
##   <date>         <dbl>   <dbl>  <dbl> <dbl>     <dbl>  <dbl>       <dbl>
## 1 2002-12-29        NA   0.101  0.073 0.062     0.101 NA              NA
## 2 2003-01-05        NA   0.143  0.098 0.047     0.039 NA              NA
## 3 2003-01-12        NA   0.176  0.119 0.051     0.059  0.071          NA
## 4 2003-01-19        NA   0.173  0.17  0.032     0.039  0.052          NA
## 5 2003-01-26        NA   0.146  0.138 0.04      0.112  0.048          NA
## # ... with 3 more variables: Singapore <dbl>, Thailand <dbl>, Venezuela <dbl>
flu_data %>% head(5)
## # A tibble: 5 x 30
##   Date       Argentina Australia Austria Belgium Bolivia Brazil Bulgaria Canada
##   <date>         <dbl>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
## 1 2002-12-29        NA        NA      NA      NA      NA    174       NA     NA
## 2 2003-01-05        NA        NA      NA      NA      NA    162       NA     NA
## 3 2003-01-12        NA        NA      NA      NA      NA    174       NA     NA
## 4 2003-01-19        NA        NA      NA      NA      NA    162       NA     NA
## 5 2003-01-26        NA        NA      NA      NA      NA    131       NA     NA
## # ... with 21 more variables: Chile <dbl>, France <dbl>, Germany <dbl>,
## #   Hungary <dbl>, Japan <dbl>, Mexico <dbl>, Netherlands <dbl>,
## #   `New Zealand` <dbl>, Norway <dbl>, Paraguay <dbl>, Peru <dbl>,
## #   Poland <dbl>, Romania <dbl>, Russia <dbl>, `South Africa` <dbl>,
## #   Spain <dbl>, Sweden <dbl>, Switzerland <dbl>, Ukraine <dbl>,
## #   `United States` <dbl>, Uruguay <dbl>

As we can see both the flu and dengue data are not tidy ( > 1 observations per row). This can easily be fixed with pivot_longer()

dengue_tidy <- dengue_data %>% pivot_longer(cols = Argentina:Venezuela,
                                            names_to = "country",
                                            values_to = "Dengue_cases")
dengue_tidy %>% head(5)
## # A tibble: 5 x 3
##   Date       country   Dengue_cases
##   <date>     <chr>            <dbl>
## 1 2002-12-29 Argentina       NA    
## 2 2002-12-29 Bolivia          0.101
## 3 2002-12-29 Brazil           0.073
## 4 2002-12-29 India            0.062
## 5 2002-12-29 Indonesia        0.101
flu_tidy <- flu_data %>% pivot_longer(cols = Argentina:Uruguay,
                                      names_to = "country",
                                      values_to = "Flu_cases")
flu_tidy %>% head(5)
## # A tibble: 5 x 3
##   Date       country   Flu_cases
##   <date>     <chr>         <dbl>
## 1 2002-12-29 Argentina        NA
## 2 2002-12-29 Australia        NA
## 3 2002-12-29 Austria          NA
## 4 2002-12-29 Belgium          NA
## 5 2002-12-29 Bolivia          NA
# data is now tidy

If we want to eventually merge these 3 data files together we need to make the date and country variables equal:
* date from flu and dengue data need to be split into year, month, day.
* country needs to be of class factor

dengue_tidy$country <- as.factor(dengue_tidy$country) 
flu_tidy$country <- as.factor(flu_tidy$country)

dengue_tidy <- dengue_tidy %>% separate(Date, into = c("year", "month", "day"), convert = T, sep = "-")
flu_tidy <- flu_tidy %>% separate(Date, into = c("year", "month", "day"), convert = T, sep = "-")

We will store this data so we always have our tidy versions by hand

dengue_tidy %>% write.csv("data/denguetidy.csv")
dengue_tidy %>% write_rds("data/denguetidy.rds")

flu_tidy %>% write.csv("data/flutidy.csv")
flu_tidy %>% write_rds("data/flutidy.rds")

gapminder %>% write.csv("data/gapminder.csv")
gapminder %>% write_rds("data/gapminder.rds")

Now we can insert these dataframes into SQL

dbWriteTable(con, "dengue", dengue_tidy)
dbWriteTable(con, "flu", flu_tidy)
dbWriteTable(con, "gapminder", gapminder)

To check if everything imported correctly we will go to DBeaver and inspect the data.

The imported dengue data The imported flu data The imported gapminder data

We can also get the datasets back and check them using R

dengueSQL <- dbReadTable(con, "dengue")
fluSQL <- dbReadTable(con, "flu")
gapminderSQL <- dbReadTable(con, "gapminder")

fluSQL %>% head(5)
##   year month day   country Flu_cases
## 1 2002    12  29 Argentina        NA
## 2 2002    12  29 Australia        NA
## 3 2002    12  29   Austria        NA
## 4 2002    12  29   Belgium        NA
## 5 2002    12  29   Bolivia        NA
gapminderSQL %>% head(5)
##               country year infant_mortality life_expectancy fertility
## 1             Albania 1960           115.40           62.87      6.19
## 2             Algeria 1960           148.20           47.50      7.65
## 3              Angola 1960           208.00           35.98      7.32
## 4 Antigua and Barbuda 1960               NA           62.97      4.43
## 5           Argentina 1960            59.87           65.39      3.11
##   population          gdp continent          region
## 1    1636054           NA    Europe Southern Europe
## 2   11124892  13828152297    Africa Northern Africa
## 3    5270844           NA    Africa   Middle Africa
## 4      54681           NA  Americas       Caribbean
## 5   20619075 108322326649  Americas   South America

It seems like inserting it into DBeaver and pulling it back changes the data classes for every variable.
This is not a big problem, we just have to make sure to change the classes of certain values whenever we import something from DBeaver.
we can now combine these 3 dataframes into 1 big dataframe based on country and year.

flu_dengue_combined <- full_join(dengueSQL, fluSQL, by = c("country", "year", "month", "day"))

# lets check how many years all 3 dataframes are covering

flu_dengue_combined$year %>% min()
## [1] 2002
flu_dengue_combined$year %>% max()
## [1] 2015
# flu and dengue have a range from 2002 to 2015

gapminderSQL$year %>% min()
## [1] 1960
gapminderSQL$year %>% max()
## [1] 2016
# gapminder has a range from 1960 to 2016.
# we dont need all those years so we will cut those out

gapminder_filtered <- gapminderSQL %>% filter(year >= 2002) %>% filter(year <= 2015)

# lets update the gapminder data in SQL so it only contains this dataframe instead of the big one because it is not required

dbWriteTable(con, "gapminder", gapminder, overwrite = T)

# now we can combine all the data frames

combined_all <- full_join(flu_dengue_combined, gapminder_filtered, by = c("country", "year"))
dbWriteTable(con, "gap_flu_den", combined_all)

Now that we have a table with all the data combined we can ask SQL to extract specific data to do some analysis on.

Lets say we want to know if there is a difference in the amount of flu cases of brazil and argentina in 2007.

# getting all the relevant data

arg_bra_flucases <- dbGetQuery(con, 
                                  "SELECT 
                                      \"month\",
                                      country, 
                                      gfd.\"Flu_cases\", 
                                      population
                                   FROM 
                                      gap_flu_den gfd
                                   WHERE 
                                      country IN (\'Argentina\', \'Brazil\') AND
                                      \"year\" = 2007;")

# because brazil has a much higher population we will use the cases per 100.000 citizens.

arg_bra_flucases <- arg_bra_flucases %>% mutate(per_100.000 = Flu_cases * 100000 / population)

summary_arg_bra_flu <- arg_bra_flucases %>%
  group_by(month, country) %>%
  summarise(total = sum(per_100.000))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
summary_arg_bra_flu %>%
  ggplot(aes(x = month.name[month], y = total, colour = country)) +
  geom_point() +
  geom_path(aes(group = country)) +
  labs(title = "Flu cases per 100.000 citizens in argentina and brazil",
       subtitle = "Data from 2007",
       x = "Month of the year",
       y = "Cases per 100.000 citizens") +
    scale_x_discrete(limits = month.name) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 60, vjust = 0.7))

# on first glance Argentina seems to have more cases per 100.000 citizens, but just to make sure we will do a two sample t-test

summary_arg_bra_flu %>%
  group_by(country) %>%
  summarise(p.value.normality = shapiro.test(total)$p.value)
## # A tibble: 2 x 2
##   country   p.value.normality
##   <chr>                 <dbl>
## 1 Argentina             0.210
## 2 Brazil                0.312
# both have p > 0.05, so they are normally distributed, now we can continue with levennes test.

leveneTest(summary_arg_bra_flu$total, as.factor(summary_arg_bra_flu$country), center = mean)$P
## [1] 0.0001063121           NA
# p < 0.05, so they do not have equal variance. we can now perform the t test

t.test(formula = summary_arg_bra_flu$total ~ summary_arg_bra_flu$country,
       paired = F, var.equal = F)$p.value
## [1] 0.0003999675
# p < 0.001, There is a significant difference between the total amount of cases per 100.000 citizens in Argentina and Brazil.

This had some interesting results. lets try something different this time.
lets see if some south-east asian countries have reduced their dengue infections between 2002 - 2015.

SEAsia_dengue_cases <- dbGetQuery(con, 
                                  "SELECT 
                                      gfd.\"year\",
                                      gfd.\"Dengue_cases\",
                                      country
                                   FROM 
                                      gap_flu_den gfd
                                   WHERE 
                                      country IN (\'Singapore\', \'Indonesia\', \'Thailand\', \'Philippines\')")

summary_SEAsia_den_cases <- 
  SEAsia_dengue_cases %>%
  group_by(year, country) %>%
  summarize(total = sum(Dengue_cases))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
summary_SEAsia_den_cases %>%
  ggplot(aes(x = year, y = total, colour = country)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2002, 2015, by = 2)) +
  labs(title = "Total dengue cases per year",
       x = "Year",
       y = "Total cases per year") +
  theme_classic()
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 9 row(s) containing missing values (geom_path).

The dengue cases seem to stay mostly stagnant with a few peaks here and there for every country, but in 2015 all dengue cases seem to have dropped which could be a good sign.
Now lets take a look at one last graph. It is always said that the flu peak is always during winter because the cold makes it easier for the influenza virus to infect people. lets check this for ourselves by checking the flu incidence for the year 2009 in 5 northern hemisphere countries.

flu_incidence_northern <- dbGetQuery(con,
                                     "SELECT
                                        country,
                                        \"month\",
                                        gfd.\"Flu_cases\"
                                      FROM 
                                        gap_flu_den gfd
                                      WHERE 
                                        country IN (\'Netherlands\', \'Norway\', \'France\', \'Sweden\', \'Switzerland\') AND
                                        \"year\" = 2009")

summary_flu_inc_north <- flu_incidence_northern %>%
  group_by(month, country) %>%
  summarize(total = sum(Flu_cases))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
summary_flu_inc_north %>%
  ggplot(aes(x = month.name[month], y = total, group = country, fill = country)) +
  geom_col(position = position_dodge()) +
  labs(title = "Flu incidence for different countries",
       subtitle = "Data from 2009",
       x = "Month",
       y = "Flu incidence") +
  scale_x_discrete(limits = month.name) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 60, vjust = 0.8))

As we can see in this graph, all

References

Chicard, Mathieu, Leo Colmet-Daage, Nathalie Clement, Adrien Danzon, Mylène Bohec, Virginie Bernard, Sylvain Baulande, et al. 2018. “Whole-Exome Sequencing of Cell-Free DNA Reveals Temporo-spatial Heterogeneity and Identifies Treatment-Resistant Clones in Neuroblastoma.” Clinical Cancer Research 24 (4): 939–49. https://doi.org/10.1158/1078-0432.CCR-17-1586.
Riggs, Erin Rooney, Erica F. Andersen, Athena M. Cherry, Sibel Kantarci, Hutton Kearney, Ankita Patel, Gordana Raca, et al. 2020. “Technical Standards for the Interpretation and Reporting of Constitutional Copy-Number Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen).” Genetics in Medicine: Official Journal of the American College of Medical Genetics 22 (2): 245–57. https://doi.org/10.1038/s41436-019-0686-8.
Valsesia, Armand, Aurélien Macé, Sébastien Jacquemont, Jacques S. Beckmann, and Zoltán Kutalik. 2013. “The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation.” Frontiers in Genetics 4: 92. https://doi.org/10.3389/fgene.2013.00092.
Van Paemel, Ruben, Charlotte Vandeputte, Lennart Raman, Jolien Van Thorre, Leen Willems, Jo Van Dorpe, Malaïka Van Der Linden, et al. 2022. “The Feasibility of Using Liquid Biopsies as a Complementary Assay for Copy Number Aberration Profiling in Routinely Collected Paediatric Cancer Patient Samples.” European Journal of Cancer (Oxford, England: 1990) 160 (January): 12–23. https://doi.org/10.1016/j.ejca.2021.09.022.
Weiser, Daniel A., Diana C. West-Szymanski, Ellen Fraint, Shani Weiner, Marco A. Rivas, Carolyn Wei Ting Zhao, Chuan He, and Mark A. Applebaum. 2019. “Progress Toward Liquid Biopsies in Pediatric Solid Tumors.” Cancer Metastasis Reviews 38 (4): 553–71. https://doi.org/10.1007/s10555-019-09825-1.